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Abstract 

The explosion of available data along with the need 
to integrate and utilize that data has led to a press- 
ing interest in data integration techniques. In terms 
of Semantic Web technologies, Ontology Align- 
ment is a key step in the process of integrating 
heterogeneous knowledge bases. In this paper, we 
present the Edge Confidence technique, a modifica- 
tion and improvement over the popular Similarity 
Flooding technique for Ontology Alignment. 

1 INTRODUCTION 

As database technologies become increasingly 
diverse, the need to integrate those technologies 
has become ever more important. The heteroge- 
neous data problem describes a common situation 
in which multiple data sources with incompatible 
descriptions and data types must be integrated for 
use by a single application. This problem is often 
encountered in the Semantic Web, which attempts 
to view the entire Internet as a unified database. A 
fundamental step in solving the heterogeneous data 
problem is the production of an alignment that can 
tell the client how to equate the descriptions of two 
heterogeneous data sources. 

An ontology alignment may be informally de- 
scribed as a set of correspondences between seman- 
tically related terms in two heterogeneous input on- 
tologies. Each correspondence is qualified with a 
confidence level, [0, 1]. Alignment is useful in data 
integration tasks dealing with what is sometimes 
referred to as the semantic heterogeneity problem. 
It helps in the automation of various important 
tasks, most important of which is schema merging. 



enabling the knowledge and data expressed in the 
input ontologies to inter-operate. 

Ontology alignment algorithms are typically 
aggregations of multiple basic techniques, which 
may be classified as lexical, semantic, structural 
etc ^. In this paper, we consider the structural 
technique of Similarity Flooding and present an 
improvement to that technique. 

The rest of this paper is organized as follows: the 
remainder of this section will cover some of the 
background material related to the Semantic Web 
and ontologies, the cases for fully automated align- 
ment, and some of the background material on the 
statistical methods used in this paper; Section |3] 
outlines the approach and implementation details; 
the evaluation and benchmarks are explained in 
Section H results are presented in Section |5] and, 
conclusions and possible future work are outlined 
in Section [6l 

1.1. Semantic Web and Ontologies 

First, we define some key terms around the Se- 
mantic Web and Ontology Alignment. 

A knowledge base is a kind of database that is 
designed for decision support systems and expert 
systems. It specifically allows machines to perform 
deductive reasoning over its elements. For exam- 
ple, given the instance data "John is a farmer" and 
"Farmers wear overalls", a knowledge base rea- 
soner could deduce the new information that "John 
wears overalls". 

Knowledge bases (KBs) consist of entities and 
relations between those entities. They are often 



represented as sets of triples, that is, a set of two en- 
tities and a relationship between those entities. For 
example, the triple {teachers Jesson_plans , write} 
defines the relationship that "teachers write les- 
son_plans". 

Ontologies describe the relationships between 
elements in a knowledge base. Similar to the no- 
tion of schemas in relational databases, ontologies 
specify the structural relationships amongst the en- 
tities and classes of the ontology. 

Frequently, queries must be performed across 
multiple knowledge bases. This is the case when 
departments in a large enterprise maintain their 
own knowledge bases, but must also share in- 
formation with other departments. This also hap- 
pens in scientific computing, especially in the 
more domain-specific Informatics areas such as 
Bioinformatics and Energy Informatics [IS]. It is 
also a fundamental requirement of the Semantic 
Web; users perform queries against the web, and 
those queries must be performed against knowl- 
edge bases that may belong to entities on oppo- 
site sides of the globe. This leads us to the problem 
of heterogeneous data. When knowledge bases are 
maintained by different organizations, the ontolo- 
gies used to describe them will usually be differ- 
ent. For instance, one ontology may keep track of 
'cars' while another keeps track of 'automobiles' 
and still another keeps track of 'autoProducts'. In 
fact, the list could go on and on. 

Ontology Alignment is the process of equating 
two heterogeneous ontologies by finding valid cor- 
respondences between their sets of elements. For 
instance, an alignment between an auto parts man- 
ufacturer and an auto dealership might tell us with 
85% certainty that "anti_lock_brakes" and "ABS" 
represent the same thing. 

This is important because it allows a query man- 
agement system to translate the terminology of a 
user's query into the terminology used by many 
different ontologies and thereby to query heteroge- 
neous knowledge bases. 



1.2. Why do we need fully automated align- 
ment? 

Fully-automated alignment techniques (i.e., on- 
tology alignment performed without any input or 
approval from a human user) represent the ideal 
scenario for many of the use cases already dis- 
cussed. For example, in a Semantic Web query, the 
details of equating ontology elements across het- 
erogeneous KBs should remain completely invis- 
ible to the user, who simply wants to know the 
show-times of movies in his or her zip code, which 
star actors with a maximum Kevin-Bacon-Distance 
of three [6J. On the other hand, in an Enterprise 
Integration context much of the information to be 
integrated is extremely domain specific and even 
organization specific, and possibly only known by 
a handful of organization veterans. In such a case, a 
fully-automated alignment process may not be en- 
tirely feasible. 

In any case, fully-automated alignment is an as- 
yet-unattained goal. In this paper we present an 
improvement to Similarity Flooding, a structural 
alignment technique. However, both the original al- 
gorithm and our improvement rely on comparison 
against an ideal alignment, produced by a human, 
for evaluation of the quality of the alignment. For 
our evaluation, we have used the Semantic Evalu- 
ation At Large Scale (SEALS) automated testing 
platform, which subjects an algorithm to a battery 
of alignment test, and produces precision and recall 
scores based on a comparison of its results against 
an ideal alignment result. 

2 RELATED WORK 

Although the problem of fully-automated ontol- 
ogy alignment is far from being solved, there has 
been much work accomplished around fundamen- 
tal techniques for element matching and aggrega- 
tion of those matching techniques [5]. One class of 
approaches attempts to find element matches based 
on the relative similarity or dissimilarity of the ac- 
tual labels given to those elements. Some exam- 
ples of these string-matching techniques are the 



Jaro-Winkler measure [jTSl . the Levenshtein Dis- 
tance [[TTI and Latent Semantic Indexing [2] . While 
useful in many situations, string-based matching 
techniques suffer from a common shortcoming; 
similar real-world objects often have very dissimi- 
lar names. The words "car" and "automobile" pro- 
vide a good example. 

Another class of techniques makes use of outside 
resources as aids in the search for good matches 
between elements. Such outside resources include 
dictionaries or taxonomies such as the WordNet 
taxonomy ifTOl . These classes of techniques at- 
tempt to produce correspondences between ele- 
ments that may likely refer to the same real-world 
objects, but that have very dissimilar names, such 
as "car" and "automobile". Some examples of these 
semantic approaches include Information-theoretic 
similarity [[81. The use of outside taxonomies can 
greatly improve the quality of matching results, but 
such outside resources are not always readily avail- 
able. 

Yet a third class of matching techniques attempts 
to exploit the structure of the ontology itself. For 
example. Similarity Flooding [[9l takes a directed 
graph representation of an ontology and uses neigh- 
bor relations between the elements to find match- 
ing correspondences between them. The idea is 
that if two elements in heterogeneous ontologies 
are very similar, then their neighboring elements 
should also be very similar. 

The Similarity Flooding algorithm operates by 
producing a new graph that represents relationships 
amongst the entities in each input ontology. The al- 
gorithm first takes the cross-product of all nodes 
in both ontologies, producing a single node in the 
result graph for each pairing of nodes in the in- 
put ontologies. Edges in the result graph are pro- 
duced if and only if the original nodes from the in- 
put ontologies both shared an edge. This results in a 
Pairwise Connectivity Graph. Finally, weights are 
added to the edges such all outbound edges from a 
given node have equal weight, and sum to one. The 
result is a Propagation Graph. 



Once the Propagation Graph has been generated, 
the similarity score of each node is generated as 
follows: each node is assigned an arbitrary initial 
similarity score which is refined through an iter- 
ative fix-point computation. At each iteration the 
new similarity is equal to the old similarity plus the 
weighted sum of the similarity of all its neighbors 
in the propagation graph. This fix-point computa- 
tion proceeds until the new similarity converges 
to a fixed value. Similarity flooding has been use- 
ful as a foundation in other structure-based match- 
ing techniques such as anchor flooding, but suf- 
fers from a few limitations. First, it requires that 
the edges of the edge-labeled graph representa- 
tion have identically named labels. In the case that 
corresponding edge labels do not have exactly the 
same name, but mean the same thing (for example 
"hasA" and "has_a") that information is completely 
lost to the flooding algorithm. 

3 APPROACH & IMPLEMENTATION 

The following sections outline and present the 
details about the model as well as many of the im- 
plementation details. 

3.1. Levenshtein Edit Distance 

The Levenshtein distance is a string similarity 
metric for measuring the difference between two 
character sequences. The distance between two se- 
quences is equal to the number of single-character 
operations required to transform one sequence into 
the other Wfj . The single-character operations are 
insertion, deletion, and substitution. 

Definition 1. Let x and y be terms in a single on- 
tology. The label set K{x,y) is the set of all labels 
between hierarchical properties and object proper- 
ties where x and y are the domain and range of a 
property, respectively. 

3.2. Edge Confidence 

These lexical similarity techniques give us a way 
of assessing the similarity of two entities regard- 
less of any structural relationships they may have 
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Figure 1. Levenshtein Edit Distance 
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Figure 2. Edit Similarity Distance 
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within an ontology. Recall that one limitation of 
the Similarity Flooding technique is the necessity 
that predicates (edge labels when the ontology is 
represented as a graph) must have names that cor- 
respond exactly. But this is an unrealistic require- 
ment. It is easy to imagine that two unrelated on- 
tologies might, for instance, use predicates labeled 
"appearsin" and "actsin" to describe the relation- 
ship that a certain actor has been in a movie (possi- 
bly with Kevin Bacon). 

We propose using lexical similarity to quantify 
a degree of similarity between predicate levels. If 
the similarity is above some arbitrary threshold, we 
will consider the two predicates to mean the same 
thing. In this way, we can include more edges in the 
propagation graph, which provides more informa- 
tion about structural relationships to the alignment 
algorithm. 

Next we are faced with the problem of assigning 
an actual value to the edge similarity. In the propa- 
gation graph for similarity flooding, each outbound 
edge from a node is given the same weight as all of 
the other edges from that node, and those weights 
all sum to one. This allows the graph to be de- 
scribed by a row- stochastic matrix. We would like 
to keep that row-stochastic property for our edge 



confidence implementation, but we would also like 
to have different weights on each outbound edge, 
proportionate to the degree of similarity (and thus 
the confidence) shared by the ontology predicates 
represented by the edge in the propagation graph. 

To this end, we start with a dissimilarity metric, 
such as one provided by the Levenshtein distance 
algorithm. Next we derive a complement for this 
weight by taking its difference from the sum of all 
outbound edge weights for a particular node. Fi- 
nally, the edge weight is given as the ratio of that 
complement, and the sum of all complements for 
outbound edges of a particular node. 

This concept is formalized in the definitions that 
follow. 

Definition 2. Let y be some threshold. Edge confi- 
dence r(a, P) is the similarity score between the 
labels of two edges a and |3 if that similarity is 
greater than y, otherwise 0. That is, 
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3.3. Unnormalized Pairwise Markov Chain 

In the definitions that follow, consider the input 
ontologies presented in Figure |3] 



A Markov Chain is a mathematical model that 
represents a system as a set of states and a set 
of probabilistic transitions between those states. 
Most importantly, Markov processes adhere to the 
Markov Property, that is, a system's next state de- 
pends only on its current state, and not on any of 
its previous states. In that sense, Markov processes 
are called "memoryless" yj. 

Definition 3. An Unnormalized Pairwise 
Markov Chain is a not-necessarily stochastic 
Markov Chain that satisfies the following property. 
For every pairwise grouping of ontological terms 
between the two input ontologies, there exists a 
transition in the UPMC 

{x,y) -^ {x',y') 

with probability r{A{x,y) ,A{x' ,y')) if and only if 

• there exists an edge from x to x', and 

• there exists an edge from y to y'. 

The creation of a UPMC creates a directed graph 
between pairs of concepts that relate pairs based on 
their structural similarity. The idea is that a pair of 
concepts are more likely to be similar if they are 
structurally similar. Each pair of ontological terms 
is connected to other pairs of ontological terms via 
directed edges that are proportional to similarity of 
the edges that exist in the input ontologies. 



UPMC. First, determine the sum of the current re- 
ciprocal row sums. 

Then, add the reciprocal row sums to each value 
in the matrix, storing each value in a temporary 
matrix T. 

Normalize each value by dividing each value in T 
by its row sum. 
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The matrix that results from applying the transfor- 
mation described in Definition |4] is row stochastic 
(its rows sum to one). When viewed as the 1-step 
probability transition matrix for a Markov Chain, 
it is easy to see that the weights on the outgoing 
edges for each state sum to one. After the transfor- 
mation, a NPMC represents a Markov Chain where 
each pair of ontological terms has a certain proba- 
bility of being related to other pairs in the chain in 
proportion to the edge confidences that were calcu- 
lated earlier. 



3.4. Normalized Pairwise Markov Chain 

In order to take advantage of the convergence 
properties outlined earlier in this paper, the UPMC 
needs to have its probability transition matrix con- 
verted into a row stochastic form. This is done by 
normalizing the row sums of the matrix such that 
they sum to one. 

Definition 4. a Normalized Pairwise Markov 
Chain (NPMC) is defined as the Markov Chain 
generated from a UPMC by normalizing the row 
sums of the matrix such that they sum to one. Let 
P be the l-step probability transition matrix for an 



3.5. Iterative Approach 

Once approach to finding the stationary distribu- 
tion of the NPMC is to compute the limiting proba- 
bilistic state lim^t^oo tt/''^ in an iterative fashion. Let 
£ be some threshold. 
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The values for the initial similarity distribution 71° 
are taken from the non-structural similarity scores 
for the ontological terms corresponding to each 
state. 
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Figure 3. Example Input Ontologies with Object Properties 



3.6. Steady-State Approach 

Another approach to finding the stationary dis- 
tribution of the NPMC is to compute the limiting 
probabilistic state limyt^.cxjTT/'^ directly. This can be 
done by solving a left eigenvalue problem: 

71 = TIP ^ 71(^-7) =0 

where the eigenvalue is 1 and / is the identity 
matrix. Simply solve for n by computing the left 
nuUspace of the P — I matrix (appropriately sliced) 
and then normalizing n so that | |7r| | = 1 . In order to 
accomplish this in an easy fashion, the ScalaTion 
library was used [11 J. ScalaTion is an Domain- 
Specific Emebedded Language (DSEL) for Mod- 
eling & Simulation, written in the Scala program- 
ming language lfT3l . 

3.7. Notes on Convergence 

Since the NPMC model is a MC, certain things 
can be said about the alignment (stationary distri- 
bution), such as [IT2II : 



If the stochastic probability transition matrix P 
is symmetric then the MC has a unique align- 
ment/stationary distribution. 

If P is irreducible (but not necessarily aperi- 
odic), then for any j a j 1, the matrix P' = aP 
-I- (1 - a)I is stochastic, irreducible and aperi- 
odic, and has the same stationary distribution 
as P. Note: I is the identity matrix. Since P' 
is finite, irreducible, and aperiodic, it's also 
ergodic and therefore has a unique station- 
ary distribution! If we can generate such a P' 



from our NPMC, then we know it has a unique 
alignment, according to our model. 

3.8. Refining Results 

The results generated from using either the itera- 
tive approach or the steady- state approach are only 
somewhat meaningful. The output of both proce- 
dures produces a two-dimensional distribution of 
similarity scores between the two input ontologies. 
Although this satisfies the definition of an align- 
ment as presented earlier in this paper, a user is 
likely more interested in the set of similarities that 
yield the best correspondence between the two in- 
put ontologies. 

Take the alignment distribution generated by ei- 
ther the iterative or steady- state approach and de- 
compose it into an m X n matrix. Consider this 
matrix to be representative of a weighted bipartite 
graph. Now, finding the set of similarities that yield 
the best correspondence between the two ontolo- 
gies is simply a matter of solving the maximum- 
weighted bipartite graph matching problem using 
the generated graph. In order to accomplish this 
task, the Hungarian algorithm is used [|71. 

4 EVALUATION 

In order to evaluate the model presented in this 
paper, the bibliographic ontology benchmark pro- 
vided by the Ontology Alignment Evaluation Ini- 
tiative (OAEI) was used. This benchmark test li- 
brary consists of data sets that are built from 
the bibliographic reference ontology. Each test in- 
cludes a reference alignment in order to facilitate 
the calculation of precision and recall. 



In order to evaluate the ontology alignment mod- 
els presented in this paper, the Semantic Evaluation 
At Large Scale (SEALS) Platform was used [4. 16 1. 
The SEALS Platform is an extensible infrastruc- 
ture that facilitates the remote evaluation of vari- 
ous semantic technologies. In our evaluation, the 
SEALS Platform was used to easily compare the 
models presented in this paper using the OAEI 
benchmark. We compared the reference implemen- 
tation of Similarity Flooding available from Stan- 
ford University to our modified version with the 
Edge Confidence implementation. 

5 RESULTS 

We compared our Edge Confidence algorithm to 
the stock Similarity Flooding algorithm using the 
SEALS Benchmark 1 version 2.0 suite of tests. 
The SEALS platform subjects each algorithm to a 
battery of alignment tests, then compares the re- 
sult alignment generated by the algorithm to an 
ideal alignment, which is included as part of the 
suite. For each alignment test, the platform pro- 
duces scores for precision and recall. 

Precision is defined (Def |5] as the ratio of the 
number of correct correspondences to the total 
number of correspondences returned by the algo- 
rithm. It should be noted that precision is mainly 
a penalty against false positives, with no penalty 
against false negatives. For example if there were 
100 valid correspondences between two ontolo- 
gies, a given algorithm could returned only a single 
correspondence and, if that correspondence were 
correct, would score a 100% precision. 

Definition 5. Precision 



Definition 6. Recall 



precision 



I {valid} n {returned} \ 
\{returned}\ 



The converse of the precision metric is the recall 
metric (Def [6l which provides a penalty for false 
negatives, but not for false positives. Recall consid- 
ers the ratio of the correct correspondences to the 
total number of correspondences that should have 
been returned by the algorithm. 



recall 



\{valid} n {returned}] 
I {valid} I 



Finally, we consider the F- measure (Def |7]), a 
metric that provides a balance between precision 
and recall [[14|.. 

Definition 7. F-measure 



F^2* 



precision * recall 
precision + recall 



When comparing Similarity Flooding and Edge 
Confidence on the basis of precision. Similarity 
Flooding typically outperforms Edge Confidence, 
as can be seen in Figure |4] This reflects the con- 
servative nature of Similarity Flooding. Because 
Edge Confidence is more aggressive about includ- 
ing edges, it generates a larger propagation graph, 
which leads to more correspondences on average, 
although this will typically include more incorrect 
results, thus lowering the precision score. 

By the same reasoning. Edge Flooding typically 
outperforms Similarity Flooding on the basis of re- 
call. Because Edge Flooding return more results, it 
has more correct results, which gives a higher recall 
score. This is evident in Figure |5] 

The real question is: does the more aggressive 
approach pay off in the F-measure score, or does 
it extract too much of a penalty because of its lack 
of precision? Figure |6] shows that the Edge Flood- 
ing technique performs very well against Similar- 
ity Flooding, performing orders of magnitude bet- 
ter on many of the tests. 

6 CONCLUSIONS & FUTURE WORK 

In this paper, we have shown an improvement 
to the popular Similarity Flooding technique for 
producing ontology alignment. Our evaluation on 
the SEALS platform has shown promising results. 
While the overall scores for both algorithms are 
very low, it should be noted that these algorithms 
are not intended for stand-alone use, but rather as 



building blocks in more sophisticated suites of on- 
tology alignment tools. There is also more work to 
be done in the future. One avenue for improvement 
is to adjust the definition of edge confidence so 
that it includes information about the structural re- 
lationships between different object properties. For 
example, object properties have parent-child rela- 
tionships similar to ontological terms. A new edge 
confidence function could take advantage of these 
relationships in order to consider a more structural 
approach to the alignment problem. 
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